Skip to content

perf(unbundle): skip re-emit for already-indexed children (PE-9098)#740

Open
arielmelendez wants to merge 2 commits into
developfrom
PE-9098-parser-skip-already-indexed-children
Open

perf(unbundle): skip re-emit for already-indexed children (PE-9098)#740
arielmelendez wants to merge 2 commits into
developfrom
PE-9098-parser-skip-already-indexed-children

Conversation

@arielmelendez

Copy link
Copy Markdown
Contributor

Summary

When the bundle repair worker re-queues a bundle whose previous parse had some children dropped at the data-item indexer's queue cap, the parser used to re-emit every child — including the 90%+ already indexed. Each redundant emit consumed a slot in the dataItemIndexer / ans104DataIndexer queues, contending with the genuinely-new children for the cap. Under sustained backfill load this is the dominant contributor to the 2M+ items dropped per hour we observed on the Turbo gw2 indexer after PE-9098 landed.

This PR adds a skip-set: before parsing, the Ans104Unbundler queries BundleIndex.getIndexedChildIds(parentId) for ids already in new_data_itemsstable_data_items under this parent. The set is passed through to the parser worker. For each parsed child whose id is in the set, the parser counts it toward matchedItemCount (preserving the bundle's matched_data_item_count invariant — see below) and skips the filter eval, hash, and DATA_ITEM_MATCHED emit.

Why count skipped items toward matchedItemCount

upsertBundle updates matched_data_item_count with IFNULL(@matched, matched) — passing a value replaces it. The fully-indexed predicate is strict equality between bundle_data_items row count and matched_data_item_count. If we lowered the emitted matched count by the skip amount, the new lower value would replace the existing one, and the equality would never hold afterwards (rows > matched, so fully_indexed_at never sets and the bundle stays in the repair pool forever). Counting skips toward matchedItemCount avoids that.

Assumption: the unbundle/index filter has not changed between parses. When filters do change, the filter-reprocess path clears the bundle's state separately and skipChildIds will be empty for those bundles — so this assumption is enforced at the workflow layer, not implicit.

Failure handling

If getIndexedChildIds throws (DB lock, breaker open, etc.) we log and proceed without a skip-set. Worst case is the pre-fix behavior of redundant emits — not a regression, not a hard failure.

Test plan

  • yarn build clean
  • yarn lint:check clean on changed files
  • yarn test:file src/workers/ans104-unbundler.test.ts src/workers/bundle-repair-worker.test.ts — 12 tests pass
  • Deploy to gw2 (which is already running the predecessor PR): expect data_items_dropped_total{queue_name="dataItemIndexer"} and {queue_name="ans104DataIndexer"} rates to drop dramatically (skip rate on repair-driven re-parses should also fall as queue pressure decreases). New alreadyIndexedSkippedCount is included in the UNBUNDLE_COMPLETE message; surfacing it as a metric is a follow-up.

Notes

  • Base branch is PE-9098-bundle-repair-bdi-routing (the previous PR). Once that lands, this PR's diff will be just the skip-set change.
  • Five-edit DB plumbing for selectIndexedChildIds + getIndexedChildIds: SQL, worker impl, queue wrapper, message handler, BundleIndex interface.
  • The new alreadyIndexedSkippedCount value flows out of the parser worker through UNBUNDLE_COMPLETE but isn't yet wired to a Prometheus counter. A small follow-up should add one so we can verify the optimization from Grafana.

🤖 Generated with Claude Code

When the bundle repair worker re-queues a bundle (because some of its
children were dropped at the data-item indexer's queue cap on a prior
attempt), the parser previously re-emitted EVERY child — including the
90%+ that were already indexed. Each redundant emit consumed a slot in
the dataItemIndexer / ans104DataIndexer queues, contending with the
genuinely-new children for the cap. Under sustained backfill load this
was the dominant contributor to the 2M+ items dropped per hour
observed on the Turbo gw2 indexer.

The fix: before parsing, the Ans104Unbundler queries the BundleIndex
for the set of ids already present in new_data_items or
stable_data_items under this parent. The set is passed to the parser
worker. For each parsed item whose id is in the set, the parser
increments matchedItemCount (preserving the bundle's
matched_data_item_count invariant — see below) and skips the
filter eval, hash, and DATA_ITEM_MATCHED emit. Net effect: redundant
queue traffic disappears and the downstream queues breathe.

matched_data_item_count invariant: upsertBundle uses
`IFNULL(@matched, matched)` to update the column, and the
fully-indexed predicate is strict equality between bundle_data_items
row count and matched_data_item_count. If we lowered the emitted
matched count by the skip amount, the new lower value would replace
the existing one and the equality would never hold (rows >
matched_data_item_count from then on, so fully_indexed_at never sets).
Counting skips toward matchedItemCount avoids this. Assumption: the
unbundle/index filter has not changed between parses — when filters
do change, the filter-reprocess path clears the bundle separately and
skipChildIds is empty for those.

Failure handling: if getIndexedChildIds throws, we log and proceed
without a skip-set rather than failing the unbundle. Worst case is
the pre-fix behavior of redundant emits, not a regression.

Five-edit DB plumbing for `selectIndexedChildIds` + `getIndexedChildIds`:
SQL, worker impl, queue wrapper, message handler, BundleIndex interface.

Tests cover three cases:
- no bundleIndex provided -> skipChildIds undefined (unchanged behavior)
- bundleIndex returns ids -> skipChildIds passed to parser
- bundleIndex throws -> skipChildIds undefined, unbundle still proceeds

Companion to PE-9098-bundle-repair-bdi-routing (the previous PR);
together they address the BDI repair pipeline end-to-end.
@codecov

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 53.15315% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.14%. Comparing base (04d09ef) to head (f749766).

Files with missing lines Patch % Lines
src/lib/ans-104.ts 0.00% 41 Missing ⚠️
src/database/standalone-sqlite.ts 28.57% 10 Missing ⚠️
src/workers/ans104-unbundler.ts 97.67% 1 Missing ⚠️
Additional details and impacted files
@@                          Coverage Diff                          @@
##           PE-9098-bundle-repair-bdi-routing     #740      +/-   ##
=====================================================================
- Coverage                              78.20%   78.14%   -0.07%     
=====================================================================
  Files                                    121      121              
  Lines                                  42799    42908     +109     
  Branches                                3225     3230       +5     
=====================================================================
+ Hits                                   33471    33530      +59     
- Misses                                  9286     9336      +50     
  Partials                                  42       42              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Two refinements to the parser-skip-already-indexed work, addressing
feedback from review:

1. **Tighten the predicate to `data_offset IS NOT NULL`** on both
   `new_data_items` and `stable_data_items`. The optimistic insert
   path (insertOrIgnoreNewDataItem, used by the admin
   `/queue-data-item` route) intentionally leaves the eleven root-atom
   columns NULL — including `data_offset` — expecting a subsequent
   full parser-emit to fill them via the COALESCE/IFNULL clauses in
   upsertNewDataItem's `ON CONFLICT UPDATE`. Without this predicate,
   our skip-set would include optimistically-inserted rows and strand
   their tuple fields permanently NULL across re-parses. With it, we
   only skip rows whose tuple fields are already populated — the case
   where re-emit is a true no-op.

   Cost: the existing covering indexes `(parent_id, id)` no longer
   cover the query (data_offset isn't in the index). Each row pays one
   extra page read. For a 10K-child bundle, ~10K extra heap reads.
   Still fast (sub-100ms worst case). If this becomes hot we can add a
   partial index `(parent_id, id) WHERE data_offset IS NOT NULL`.

2. **Add `bundles_data_items_skipped_already_indexed_total`** counter.
   Wired from the `ANS104_UNBUNDLE_COMPLETE` event handler off the
   `alreadyIndexedSkippedCount` value the parser already includes in
   its completion message. No label — the only realistic value would
   be `bundle_format` and ANS-104 is the only format in production
   use; an unused dimension on a Prometheus metric is just clutter.

Tests unchanged (12 pass). Build + lint clean.
Base automatically changed from PE-9098-bundle-repair-bdi-routing to develop May 18, 2026 15:39
@arielmelendez

Copy link
Copy Markdown
Contributor Author

Might not ship this after all.

@vilenarios

Copy link
Copy Markdown
Contributor

Hey @arielmelendez — we're prepping a release and noticed your comment "might not ship this after all." Should we close PR #740, or do you want to revisit? Just want to align before cutting the next release. cc @vilenarios

@arielmelendez

Copy link
Copy Markdown
Contributor Author

Hey @arielmelendez — we're prepping a release and noticed your comment "might not ship this after all." Should we close PR #740, or do you want to revisit? Just want to align before cutting the next release. cc @vilenarios

@vilenarios this hasn't been integrated or adequately tested yet and it's not clear to me that it's yet necessary for optimal system performance in the real-life scenarios we're encountering. Let's let this one hang back a bit longer while we continue fine tuning the system in other areas a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants